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Abstract 

Integrating the outputs of multiple classifiers via combiners or meta-learners has 
led to substantial improvements in several difficult pattern recognition problems. In 
the typical setting investigated till now, each classifier is trained on data taken or re- 
sampled from a common data set, or (almost) randomly selected subsets thereof, and 
thus experiences similar quality of training data. However, in certain situations where 
data is acquired and analyzed on-line at several geographically distributed locations, the 
quality of data may vary substantially, leading to large discrepancies in performance of 
individual classifiers. In this article we introduce and investigate a family of classifiers 
based on order statistics, for robust handling of such cases. Based on a mathematical 
modeling of how the decision boundaries are affected by order statistic combiners, we 
derive expressions for the reductions in error expected when such combiners are used. 
We show analytically that the selection of the median, the maximum and in general, the 
I th order statistic improves classification performance. Furthermore, we introduce the 
trim and spread combiners, both based on linear combinations of the ordered classifier 
outputs, and show that they are quite beneficial in presence of outliers or uneven classi- 
fier performance. Experimental results on several public domain data sets corroborate 
these findings. 

1 Introduction 

Since different types of classifiers have different "inductive bias" , one does not expect 
the generalization performance of two classifiers to be identical |22[ for difficult 
pattern recognition problems, even when they are both trained on the same data set. 
If only the "best" classifier is selected based on an estimation of the true generalization 
performance using a finite test set ^] , valuable information contained in the results of 
the discarded classifiers may be lost. Such potential loss of information can be avoided 
if the outputs of all available classifiers are used in the final classification decision. This 



concept has received a great deal of attention recently, and many methods for combining 
classifier outputs have been proposed J2^, |29|, E3, H. Furthermore, diversity among 
classifiers has been actively promoted, by strategies such as bagging || , arcing p| [l9[ p0[, 
boosting H, |§ |l|, ||, and correlation control (| |§, as a prelude to combining. 

Approaches to pooling classifiers can be separated into two main categories: (i) 
simple combiners, e.g., voting ^2), Bayesian based weighted product rule ]3l| , or 
averaging jy} and, (ii) meta-learners, such as arbitration or stacking ]7| |6l|. 
The simple combining methods are best suited for problems where the individual clas- 
sifiers perform the same task, and have comparable success. However, such combiners 
are more susceptible to outliers and to unevenly performing classifiers. In the second 
category, either sets of combining rules, or full fledged classifiers acting on the outputs 
of the individual classifiers, are constructed [jl, 30, [n]. This type of combining is more 



general, but is vulnerable to all the problems associated with the added learning (e.g., 
overparameterizing, lengthy training time). 

An implicit assumption in most combining schemes is that each classifier sees the 
same training data or resampled versions of the same data. If the individual classifiers 
are then appropriately chosen and trained properly, their performances will be (rela- 
tively) comparable in any region of the problem space. So gains from combining are 
derived from the diversity fl32| , fh| among classifiers rather that by compensating for 
weak members of the pool. However, in real life, there are situations where individual 
classifiers may not have access to the same data. Such conditions arise in certain data 
mining, sensor fusion and electrical logging (oil services) problems where there are large 
variabilities in the data which is acquired locally and needs to be processed in (near) 
real time at geographically separated places 1 13 . These conditions create a pool of clas- 
sifiers that may have significant variations in their overall performance. Moreover, they 
may lead to conditions where individual classifiers have similar average performance, 
but substantially different performance over different parts of the input space. 

In such cases, combining is still desirable, but neither simple combiners nor meta- 
learners are particularly well-suited for the type of problems that arise. For example, 
the simplicity of averaging the classifier outputs is appealing, but the prospect of one 
poor classifier corrupting the combiner makes this a risky choice. Weighted averaging of 
classifier outputs appears to provide some flexibility [|8[ [3?J . Unfortunately, the weights 
are still assigned on a per classifier basis rather than a per sample or per class basis. If 
a classifier is accurate only in certain areas of the input space, this scheme fails to take 
advantage of the variable accuracy of the classifier in question. Using a meta learner 
that provides different weights for different patterns can potentially solve this problem, 
but at a considerable cost. In particular, the off-line training of a meta-learner using 
substantial amount of data outputted by geographically distributed classifiers, may not 
be feasible. In addition to providing robustness, the order statistic combiners presented 
in this work also aim at bridging the gap between simplicity and generality by allowing 
the flexible selection of classifiers without the associated cost of training meta-classificrs. 

Section summarizes the relationship between classifier errors and decision bound- 
aries and provides the necessary background for mathematically analyzing order statistic 
combiners |58|. Section || introduces simple order statistic combiners. Based on these 
concepts, in Section ^ we propose two powerful combiners, trim and spread, and de- 
rive the amount of error reduction associated with each. In Section || we present the 
performance of order statistic combiners on Probenl/UCI benchmarks ||43|| . Section^ 
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discusses the implications of using linear combinations of order statistics as a strategy 
for pooling the outputs of individual classifiers. 



2 Error Characterization in a Single Classifier 

In this section we summarize the approach and results of |5^]F|, that quantify the effect 
of inaccuracies in estimating a posterior class probabilities on the classification error 
for a single classifier. This background is needed to characterize and understand the 
impact of order statistics combiners, as described in Sections 3 and 4. 

It is well known that, given one-of-L desired outputs and sufficient training samples 
reflecting the class priors, the outputs of certain classifiers trained to minimize a mean 
square or cross-entropy error criteria, approximate the a posteriori probability densities 
of the corresponding classes [[l7| |49J . Based on this result, one can model the ith output 
of the mth such classifier as: 

f?(x) + (1) 

where Pi(x) is the true posterior for ith class on input x, and ^{x) is the error of the 
mth classifier in estimating that posterior. 




Figure 1: Error regions associated with approximating the a posteriori probabilities [58] 

Now, let us decompose the error into two parts: e™(x) = (3" 1 + rf{ 1 (x). The first 
component does not vary with the input, and provides an offset, or systematic error for 
each class. The second component gives the variability from that systematic error, for 
each x in each class, and has zero mean and variance crt mlrn \. These two components 
of the error are similar to the bias and variance decomposition for a quadratic loss 
function given in |p2f , although they are at the individual input level. We will therefore 
refer to classifiers as "biased" and "unbiased" implying /3™ ^ for some k,m, and 



1 This and other related papers can be downloaded from URL \tltp:/ / www, lans. ece. utexas. edu . 
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/3™ = , Vfc, to, respectively. Let b m denote the offset between the ideal class boundary, 
x* (based on Pi(x) = Pj(x)) and the realized boundary, x™ (based on f™(x) = f™(x)), 
as shown in Figure |l| Q . This boundary offset (b m — x™ — x* ) has mean and variance 
given respectively by: 

Qra _ Qra 

pm = Pi ^ (2) 



and 
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*U = 2 ' , (3) 

where s = p'Ax*) — Pi(x*) as introduced in p8| . 

Let us further denote the probability density function of this boundary offset by 
fb(x). The expected model error associated with the selection of a particular classifier 
to, can then be expressed as: 

/oc 
A(b)f b (b)db, (4) 

where A(b) — J x „ +b (pj(x) — Pi(x)) dx is the error due to the selection of a particular 
decision boundary. In general, it is not possible to obtain the density function for 
the boundary offset without making assumptions on the distributions of the errors. 
However, a first order approximation, derived in p8| , leads to: 

EZodel - J ^ ^b 2 sf b (b)db. (5) 
Let us define the first and second moments of the boundary offset as follows: 

/>oo />oo 

.Mi = / xfb(x)dx and M 2 = / x 2 fb(x)dx. 



If the individual classifiers are unbiased, the offset b m of a single classifier has M.\ = 
and M.i = a 2 m , leading to: 

pm _ sM 2 _ sa% m 

Now, if the classifiers are biased, the variance of b is left unchanged (given by Equa- 
tion ||), but the mean becomes (3 — ■ In other words, we have M\ = [3 m and 
<r 2 m = M.2 — M. 2 , leading to the following model error: 

E2o d eM = ^ = | (4» + (D 2 ). (7) 

To emphasize the distinction between biased and unbiased classifiers, the model error 
will be given as a function of /3 for biased classifiers. A more detailed derivation of class 
boundaries and error regions is presented in [|58t . For analyzing the error regions after 
combining and comparing them to the single classifier case, one needs to determine how 
the first and second moments of the boundary distributions arc affected by combining. 
The following sections focus on obtaining those values for various combiners. 
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3 Combining Multiple Classifiers through Order Statis- 
tics 



3.1 Basic Concepts 

In this section, we briefly discuss some basic concepts and properties of order statistics. 
Let X be a random variable with probability density function f x {-), an d cumulative 
distribution function F x (-). Let (X\, X2, ■ • • , Xn) be a random sample drawn from this 
distribution. Now, let us arrange them in non-decreasing order, providing: 

X\;N < X 2 ;N < • • ■ < X N:N . 

The ith order statistic denoted by Xi-.jf, is the ith value in this progression. The cumu- 
lative distribution function for the smallest and largest order statistic can be obtained 
by noting that: 

F Xn:N (x) = P(X N:N <x)= ILfLiP{X l:N <x) = [F x (x)] N 

and: 

Fx 1:N (x) = P{X 1:N <x) = l- P(X 1:N >x) = l- niI 1 P(X i:A r > x) 
= 1 - (1 - Uf =1 P(X i:N < x) = 1 - [1 - F x (x)] N 

The corresponding probability density functions can be obtained from these equations. 
In general, for the ith order statistic, the cumulative distribution function gives the 
probability that exactly i of the chosen X's are less than or equal to x. The probability 
density function of Xj : jv is then given by |Q: 

fx^x) = - [Fxix)]*- 1 [1 - F x (x)f- 1 f x (x) . (8) 

This general form however, cannot always be computed in closed form. Therefore, 
obtaining the expected value of a function of x using Equation || is not always possible. 
However, the first two moments of the density function are widely available for a variety 
of distributions ||. These moments can be used to compute the expected values of 
certain specific functions, e.g., polynomials of order less than two. 



3.2 Combining Unbiased Classifiers through Order Statistics 

Now, let us turn our attention to order statistics (OS) combiners. For a given input x, 
let the network outputs of each of the N classifiers for each class i be ordered in the 
following manner: 

f? N (x)<f? N (x)< ••• <fr N (x). 

Then one constructs the fcth order statistic combiner, by selecting the fcth ranked output 
for each class (f^ :N (x)), as representing its posterior |5j] ). 

In particular, max, med and min combiners are defined as follows: 

fr x (x) = if :A », (9) 

{K.N £L + i-n 

f 2 " (x) +/. 2 ' (x) ., , r . 

- V fiX. even (1Q) 

f t 2 (x) if N is odd, 

fr n (x) = fi- N (x). (11) 
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These three combiners are relevant because they represent important qualitative in- 
terpretations of the output space. Selecting the maximum combiner is equivalent to 
selecting the class with the highest posterior. Indeed, since the network outputs approx- 
imate the class a posteriori distributions, selecting the maximum reduces to selecting 
the classifier that is the most "certain" of its decision. The drawback of this method 
however is that it can be compromised by a single classifier that repeatedly provides 
high values. The selection of the minimum combiner follows a similar logic, but focuses 
on classes that are unlikely to be correct, rather than on the correct class. Thus, this 
combiner eliminates less likely classes by basing the decision on the lowest value for a 
given class. This combiner suffers from the same ills as the max combiner. However, it 
is less dependent on a single error, since it performs a min-max operation, rather than 
a max-maxQ. The median classifier on the other hand considers the most "typical" rep- 
resentation of each class. For highly noisy data, this combiner is more desirable than 
either the min or max combiners since the decision is not compromised as much by a 
single large error. 

The analysis that follows does not depend on the particular order statistic chosen. 
Therefore, we will denote all OS combiners by f£ s {x) and derive the model error, E°£ odel . 
The network output provided by f£ s (x) is given by: 

fZ s (x)=p k (x)+e° k *(x), (12) 

Let us first investigate the zero-bias case (/3fc = 0, Vfc), where we get t° k s (x) — rj k s (x). 
Proceeding as in Section g, the boundary b os is shown to be: 

bOS = vrM-vr^ ^ (13) 

For i.i.d. ryfc's, the first two moments will be identical for each class. Moreover, taking 
the order statistic will shift the mean of both t]° s and r]° s by the same amount, leaving 
the mean of the difference unaffected. Therefore, b os will have zero mean, and variance: 

2 alas 2 aal m 

(J b as = — = J = afT fc"M (14) 

s s 

where a is a reduction factor that depends on the order statistic and on the distribution 
of b. For most distributions, a can be found in tabulated form ||. For example, Table [l] 
provides a values for all order statistic combiners, up to 10 classifiers, for a Gaussian 



distribution [|| 50 1. (Because this distribution is symmetric, the a values of I and k 
where I + k = N + 1 are identical, and listed in parenthesis). 

Returning to the error calculation, we have: M° s = 0, and A4% s — of os , providing: 

™. _ s_M? _ s<j 2 bos _ saa 2 bm _ 



Equation 15 shows that the reduction in the error due to using the OS combiner 
instead of the mth classifier is directly related to the reduction in the variance of the 
boundary offset b. Since the means and variances of order statistics for a variety of dis- 
tributions are widely available in tabular form, the reductions can be readily quantified. 



2 Recall that the pattern is ultimately assigned to the class with the highest combined output. 



6 



Table 1: Reduction factors 



a for the Gaussian Distribution, based on [p0[| . 



N 


k 


a 


N 


k 


a 


N 


k 


a 


1 


1 


1.00 


6 


2 (5) 


.280 




1 (9) 


.357 


2 


1 (2) 


.682 




3 (4) 


.246 




2 (8) 


.226 


3 


1(3) 


.560 




1(7) 


.392 


9 


3(7) 


.186 




2 


.449 


7 


2(6) 


.257 




4(6) 


.171 


4 


1(4) 


.492 




3(5) 


.220 




5 


.166 




2(3) 


.360 




4 


.210 




1(10) 


.344 




1(5) 


.448 




1(8) 


.373 




2(9) 


.215 


5 


2(4) 


.312 


8 


2(7) 


.239 


10 


3(8) 


.175 




3 


.287 




3(6) 


.201 




4(7) 


.158 


6 


1(6) 


.416 




4(5) 


.187 




5(6) 


.151 



3.3 Combining Biased Classifiers through Order Statistics 

In this section, we analyze the error regions for biased classifiers. Let us return our 
attention to b os . First, note that the error terms can no longer be studied separately, 
since in general (a + b) os ^ a os + b os . We will therefore need to specify the mean and 
variance of the result of each operation]]]. Equation [l3| becomes: 

s 

Let ft = jj Ylm=i P™ be the mean of classifier biases. Since T)™'s have zero-mean, 
0k + Vk(xb) has first moment ft and variance a^ m + cr§m, with cr| m = i?[(ft™) 2 ] — ft 2 , 
where [•] denotes the expected value operator. 

Taking a specific order statistic of this expression will modify both moments. The 
first moment is given by /3/- + fi os , where /i os is a shift which depends on the order 
statistic chosen, but not on the class. Then, the first moment of b os is given by: 

s s 

Note that the bias term represents an "average bias" since the contributions due to the 
order statistic are removed. Therefore, reductions in bias cannot be obtained from a 
table similar to Table |l|. 

Now, let us turn our attention to the variance. Since ft? + ^(xf,) has variance 
<rf m + (Tam, it follows that (ft + r]k(xb)) os has variance aios = a(af m + cr? m ), where a 

'k | 1 'k 'k "k 

is the factor discussed in Section |3.2|. Therefore, the variance of b os is given by: 



<J b o 



2 ^ +fl f_ 2a <" a{a} T +al T ) 



a{ai m +ai m ), (18) 



where a^ m = — L - ^ — — is the variance introduced by the systematic errors of different 
classifiers. 



Since the exact distribution parameters of b oa are not known, we use the sampie mean and the sampie 
variance. 
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We have now obtained the first and second moments of b os , and can compute the 
model error. Namely, we have M\ s = (3 and a\ as = M% s — (M° s ) 2 , leading to: 

EZM = ^M? = S -(*l s +p 2 ) (19) 

= S -{a{ol m +4 m )+p 2 ). (20) 

The reduction in the error is more difficult to assess in this case. By writing the error 
as: 

we get: 

EZdel(P) = « KodeliP) + f W + P 2 ~ <P m ) 2 )- (21) 

Analyzing the error reduction in the general case requires knowledge about the bias 
introduced by each classifier. Unlike regression problems where the bias and variance 
contributions to the error are additive and well-understood, in classification problems 
their interaction is more complex Indeed it has been observed that ensemble 

methods do more than simply reduce the variance ]5^| . 

Based on these observations and Equation let us analyze extreme cases. For 
example, if each classifier has the same bias, cr| is reduced to zero and ft — (3 m . In this 
case the error reduction can be expressed as: 

where a balances the two contributions to the error. A small value for a will reduce the 
first component of the error (mainly variance), while leaving the second term untouched. 
The net effect will be very similar to results obtained for regression problems. In 
this case, it is important to reduce classifier bias before combining (e.g., by using an 
overparametrized model) . 

If on the other hand, the biases produce a zero mean variable, we obtain /3 = 0. In 
this case, the model error becomes: 

EZdel(P) = oc K wdel {(3) + ^ (*} m - (/? m ) 2 ) 

and the error reduction will be significant if the second term is small or negative. In 
fact, if the variation among the biases is small relative to their magnitude, the error will 
be reduced more than in the unbiased cases. If however, the variation is large compared 
to the magnitude, the error reduction will be minimal. Furthermore, if a is large and 
the biases are small and highly varied, it is possible for this combiner to do worse than 
the individual classifiers, which is a danger not present for regression problems. This 
observation very closely parallels results reported in pl[ . 

4 Linear Combining of Ordered Classifier Outputs 

In the previous section, we derived error reductions when the class posteriors are directly 
estimated through the ordered classifier outputs. Since simple averaging has also been 
shown to provide benefits, in this section, we investigate the combinations of averaging 
and order statistics for pooling classifier outputs. 
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4.1 Spread Combiner 

The first linear combination of ordered classifier outputs we study focuses on extrema. 



As discussed in Section 3.2, the maximum and minimum of a set of classifier outputs 
carry specific meanings. Indeed, the maximum can be viewed as the class for which 
there is the most evidence. Similarly, the minimum deletes classes with little evidence. 
In order to avoid a single classifier from having too large of an impact on the eventual 
output, these two values can be averaged to yield the spread combiner. This combiner 
strikes a balance between the positive and negative evidence, leading to a more robust 
combiner than either of them. 

4.1.1 Spread Combiner for Unbiased Classifiers: 

For a classifier without bias, the spread combiner is formally defined as: 

/r» = \ (in*) + if :JV (*)) = p^x) + vrw , (22) 

where: 

% pr (x) = \ fare*) + vr- N (x)) . 

The variance of rji Pr (x) is given by: 

= J<-(,) + J + lcov(r,r N (x),vf :N (x)). (23) 

where cov{-, •) represents the covariance between two variables (even when the r^'s are 
independent, ordering introduces correlations). Note that because of the ordering, the 



variances in the first two terms of Equation 23 can be expressed in terms of the individual 



classifier variances. Furthermore, the covariance between two order statistics can also 
be determined in tabulated form for given distributions. Table |^ provides these values 
for a Gaussian distribution based on p0| . This expression can be further simplified for 
symmetric distributions where u 2 1:N = cr 2 N:N (e.g., Gaussian noise model) and leads to: 



,2 1 ,,. n , ,. r l 

(7 .„r - 



(ai:N + Bx, N:N ) 0^.(,,.), (24) 



where a m: N is the variance of the mth ordered sample and i? m ,i:iV is the covariance 
between the mth and Ith ordered samples, given that the initial samples had unit 
variance |5(J. Because this is a symmetric distribution, the /3 values are also symmetric 

(e.g., 01,2:5 = 04,5:5)- 

Then, using Equation ||, the variance of the boundary offset b spr can be calculated: 

2 a r)i B P r + (T ri j s P r 



<T bsp r 

= 2 ("i:^ + Bi,n-.n) at- (25) 

Finally, through Equation ^ we can obtain the reduction in the model error due to the 
spread combiner: 

E Zodel = a l:N + B 1:N:N ^ 
& 'model ^ 
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Table 2: Some Reduction Factors B for the Gaussian Distribution, based on [j50f| . 



N 


Jfc,Z 


B 


N 


k,l 


B 


N k,l 


B 


N k,l 


B 


2 


1,2 


.318 




2,3 


.189 


1,4 


.095 


1,6 


.059 


3 


1,2 


.276 


6 


2,4 


.140 


1,5 


.075 


1,7 


.049 




1,3 


.165 




2,5 


.106 


1,6 


.060 


1,8 


.040 




1,2 


.246 




3,4 


.183 


1,7 


.048 


1,9 


.031 


4 


1,3 


.158 




1,2 


.196 


1,8 


.037 


2,3 


.154 




1,4 


.105 




1,3 


.132 


2,3 


.163 


2,4 


.117 




2,3 


.236 




1,4 


.099 


8 2,4 


.123 


2,5 


.093 




1,2 


.224 




1,5 


.077 


2,5 


.098 


2,6 


.077 




1,3 


.148 




1,6 


.060 


2,6 


.079 


9 2,7 


.063 


5 


1,4 


.106 


7 


1,7 


.045 


2,7 


.063 


2,8 


.052 




1,5 


.074 




2,3 


.175 


3,4 


.152 


3,4 


.142 




2,3 


.208 




2,4 


.131 


3,5 


.121 


3,5 


.114 




2,4 


.150 




2,5 


.102 


3,6 


.098 


3,6 


.093 




1,2 


.209 




2,6 


.080 


4,5 


.149 


3,7 


.077 




1,3 


.139 




3,4 


.166 


1,2 


.178 


4,5 


.137 


6 


1,4 


.102 




3,5 


.130 


1,3 


.121 


4,6 


.113 




1,5 


.077 




1,2 


.186 


9 1,4 


.091 








1,6 


.056 


8 


1,3 


.126 


1,5 


.073 







Based on Equation and Tables |l] and ||, Table || displays the error reductions provided 
by the spread combiner for a Gaussian noise model (for comparison purposes, the error 
reduction for the min and max combiners is also provided. Note that for the Gaussian 
distribution, the error reduction of min is equal to that of max.). 

Table 3: Error Reduction Factors for the Spread, min and max Combiners with Gaussian 
Noise Model. 



N 


spread 


min or max 


2 


.500 


.682 


3 


.362 


.560 


4 


.299 


.492 


5 


.261 


.448 


6 


.236 


.416 


7 


.219 


.392 


8 


.205 


.373 


9 


.194 


.357 


10 


.186 


.344 



4.1.2 Spread Combiner for Biased Classifiers: 

Now, if the classifier biases are non-zero, the spread combiner's output is given by: 

rw = \ (fr N (x) + if :Ar (*)) = pMx) + ( Vi (x) +p t r r . (27) 
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In that case, the boundary offset is given by: 

b s pr = (Pi + vM) spr -(P j +y j (x b )) apr ^ 

s 

which after expanding each term and regrouping can be expressed as: 

Wi + TH(*b)) liN -(Pi+V3(Xb)) 1:N 



(28) 



b spr = 



2s 



(29) 



2s 

The first moment of b spr can be obtained by analyzing each term of Equation |29| 
In fact, the offset introduced by the first and nth order statistic for classes i and j 
will cancel each other out, leaving only the average bias between the min and max 

components of the error (as in Equation [17]) , given by j3 spr = — - j 2 - — . 

The variance of b spr needs to be derived from Equation |2^. Proceeding as in Equa- 
tion [jj, the variance of the spread combiner can be expressed as: 

a b apr = ( -CHUN + -rCXN-.N + ■^Bi j N:N){o'b m + f/?™)- (30) 

For a symmetric distribution (where ai : Ar = ckn-.n), we obtain the following error: 



|(ai :J v + B ltNiN )(o%n - {D 2 ) + |(/3 spr ) 2 , (31) 



which is very similar to Equation 21, where the value of a for a single order statistic is 
now replaced by ai:W+ ^ 1,JV:JV , since the mean of the first and nth order statistic is used 
in the posterior estimate. 

4.2 Trimmed Means 

Instead of actively using the extreme values as was the case with the spread combiner, 
one can base the posterior estimate around the median values. However, instead of 
selecting one classifier output as was done for f med : one can use multiple classifiers 
whose outputs are "typical." In this scheme, only a certain fraction of all available 
classifiers are used for a given pattern. The main advantage of this method over weighted 
averaging is that the set of classifiers which contribute to the combiner vary from pattern 
to pattern. Furthermore, they do not need to be determined externally, but are a 
function of the current pattern and the classifier responses to that pattern. 

4.2.1 Trimmed Mean Combiner for Unbiased Classifiers: 

Let us formally define the trimmed mean combiner ([3k = 0, V/c) as follows: 

fl " m{x) = N 2 -N 1 + l £ f™"^ = P{ci\x) + rit im (x) , (32) 

m—Ni 



n 



where: 

m=Ni 

The variance of r;*'' jm (x) is given by: 

= (jVa _„ 1 + 1)2 E E 

= (N,-K + lA E E E 2 -^^),^^)) . (33) 

v ; Vm=iVi m—Ni l>m / 

Again, using the factors in Tables [j] and ||, Equation |33| can be further simplified. Note 
that because the Gaussian distribution is symmetric, the covariance between the /cth 
and Zth ordered samples is the same as that between the N + 1 — feth and N + 1 — Ith 
ordered samples. Therefore, Equation [33] leads to: 

2 1 \ "* 2 

CT„tri m = TTT — 7TT7 > OL r , " ~ 



(N 2 — Ni + l)^ 2- <*™» 



m=N 1 
N 2 



E E B m,l:N 0-^( x ) , (34) 



(AT 2 - JVi + 1) 

where a m -.N is the variance of the mth ordered sample and B m ^ : N is the covariance 
between the mth and Ith ordered samples, given that the initial samples had unit 
variance . Using the theory highlighted in Section ||, and Equation |34|, we obtain 
the following model error reduction: 

ptrim if 1 * 2 N2 \ 

E model (N 2 -Ni+ 1) 2 ^ ^ J 

Based on Equation [35] and Tables [l] and we have generated a sample trim com- 
biner reduction table. Because there are many possibilities for Ni and N 2 , a table that 
exhaustively provides all reduction values is not practical. In this sample table we have 
selected N% = 2 and N% = N — 1, that is, averaging after the lowest and highest values 
have been removed. For comparison purposes the reduction factors of the averaging 
combiner for N and N — 2 classifiers are also provided (for i.i.d. classifiers the reduction 
factors are 1/N as derived in ]5q ]; similar results were obtained for regression prob- 
lems p2[). As these numbers demonstrate, although N — 2 classifiers are used in the 
trim combiner, selectively weeding out undesirable classifiers provides reduction factors 
significantly better than simply averaging N — 2 arbitrary classifiers. The trim com- 
biner provides reduction factors comparable the the N classifier ave combiner without 
being susceptible to corruption by one particularly faulty classifier. 

4.2.2 Trimmed mean Combiner for Biased Classifiers: 

Now, if the classifier biases are non-zero, the trimmed mean combiner's output is given 
by: 

1 n 2 

fl " m{x) = N 2 -N 1 + l £ = + ■ (36) 

m—Ni 
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Table 4: Error Reduction Factors for Trim and two corresponding ave Combiners with 
Gaussian Noise Model. 



N 


ave (for N) 


trim (for N x = 2 ; iV 2 = N - 1) 


ave (for N - 2) 


3 


.333 


.449 


1.00 


4 


.250 


.298 


.500 


■5 


.200 


.227 


.333 


6 


.167 


.184 


.250 


7 


.143 


.155 


.200 


8 


.125 


.134 


.167 


9 


.111 


.113 


.143 



In that case the boundary offset is given by: 

utrim _ (fli + Vi(Xb)) trim - (ft + VM)Y 



(37) 

The first moment of b tnm can be obtained from a manner similar to that of the 
spread combiner. Indeed, each mean offset introduced by a specific order statistic for 
class i will be offset by the one introduced for class j. Only the trimmed mean of the 
biases will remain, giving the first moment of h tr%m : 

1 -^2 om:N am-.N 

P ~ N 2 -N 1 + l E S ■ ( 38 ) 

m—Ni 

In deriving the variance of b trim , we follow the same steps as in Sections |3.3| and 
[OTj] . The resulting boundary variance is similar to Equation ^8l but the since the 
reduction is due to the linear combination of multiple ordered outputs, a is replaced by 
A, where: 

1 ( N2 N2 — , \ 

A = (N 2 -N 1 + l? E am - N + 2 E ■ ( 39 ) 

y z 1 1 \m=N 1 m=7V! l>m / 

The model error reduction in this case is given by: 
E^AP) = \M* = \{ol nm + Mi 2 ) 
= S -(A (4.+^) + (/? spr ) 2 ) 

= A E model {f3) + S -(A {a 2 pm - (/3 m ) 2 ) + iP spr f) ■ (40) 

Once again we need to look at the interaction between the two parts of the error 
reduction. The first term provides the error reduction compared to the model error of 
an individual classifier. The smaller A is, the more error reduction there will be. In the 
second term, on the other hand, a small value for A is only useful if the variability in 
the individual biases is higher than the biases themselves (cr^m > (f3' m ) 2 ). 



5 Experimental Results 

The order statistics-based combining methods proposed in this article are tailored for 
situations where: 
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1. individual classifier performance is uneven and class dependent; 

2. it is not possible (insufficient data, high amount of noise) to fine tune the individual 
classifiers without using computationally expensive methods. 

Such situations occur, for example, in electrical logging while drilling for oil, where 
data from certain well sites almost completely misses out on portions of the problem 
space, and in imaging from airborne platforms where the classifiers receive inputs from 
different satellites and/or different types of sensors (e.g., thermal, optical, SAR). While 
we have seen such data from Schlumberger, Austin, and NASA, Houston, unfortunately 
the data sets are not standard or public domain. So, in this article we restrict ourselves 
to public domain datasets and simulate such variability by using "early stopping" i.e., 
prematurely terminating the training of the individual classifiers^]. Thus combining 
results are first reported for the case where only half the classifiers are finely tuned. 
This procedure produces an artificially created quality variation in the pool of classifiers. 

For the experiments reported below, we used a multi-layer perceptron (MLP) with a 
single hidden layer, whose weights were randomly initialized for each run. All classifica- 
tion results reported in this article are test set error rates averaged over 20 runs, along 
with the 95% confidence intervals. Several types of simple combiners such as averag- 
ing, weighted averaging, voting, median, products, weighted products (Bayesian), using 
Dempster-Schafer theory of evidence, and entropy-based averaging, have been proposed 
in the literature. However, on a wide variety of data sets, it has been observed that 
simple averaging usually provides results comparable to any of these techniques (and, 
surprisingly, often better than most of them) p6[ |59[ ]. For this reason, in this study, 
we use the average combiner as a representative of simple combiners, for comparison 
purposes. 

The first two data sets (Tables || and 0) are based on underwater sonar signals. From 
the original sonar signals of four different underwater objects (porpoise sound, cracking 
ice and two different whale sounds), two feature sets are extracted p4| : 

• WOC: a 25-dimensional feature set, consisting of Gabor wavelet coefficients, tem- 
poral descriptors and spectral measurements; and, 

• RDO: a 24-dimensional feature set, consisting of reflection coefficients based on 
both short and long time windows, and temporal descriptors. 

For both feature sets, an MLP with 50 hidden units was used. These data sets are 



available at URL http://www. lans. ece.utexas. edu . Further details about this 4-class 



problem can be found in [£4| £>9| . 

The next six data sets (Tables |6| and ||) were selected from the Probenl/UCI 
benchmarks |43[ . The Probenl benchmarks are particular training, validation and test 



splits of the UCI data sets which are available from URL http://www.ics.uci.edu/~m 



learn/MLRepositoryhtm] . The results presented in this article are based on the first 
training, validation and test partition discussed in , where half the data is used for 
training, and a quarter each for validation and testing purposes. Briefly these data sets, 
and the corresponding single layer feed-forward neural network architectures are^]: 



4 In all the experiments reported here, "high variability" among classifiers refers to classifiers being trained 
exactly half as long as the "fine tuned" classifiers. 

5 After deciding on a single hidden layered architecture, the number of hidden units was determined 
experimentally. 
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• Cancer: a 9-dimensional, 2-class data set based on breast cancer data |34j1 , with 
699 patterns; an MLP with 10 hidden units; 



• Card: a 51-dimensional, 2-class data set based on credit approval decision 




with 690 patterns; an MLP with 20 hidden units; 

• Diabetes: an 8-dimensional data set with two classes based on personal data from 
768 Pima Indians obtained from the National institute of Diabetes and Digestive 
and Kidney Diseases fl54| : an MLP with 10 hidden units; 

• Gene: a 120-dimensional data set with two classes, based on the detection of 
splice junctions in DNA sequences |59), with 3175 patterns; an MLP with 20 
hidden units; 

• Glass: a 9-dimensional, 6-class data set based on the chemical analysis of glass 
splinters, with 214 patterns; an MLP with 15 hidden units; and, 

• Soybean: an 82-dimensional, 19-class problem with 683 patterns; an MLP 
with 40 hidden units. 



Table 5: Combining Results in the Presence of High Variability in Individual Classifier 
Performance for the Sonar Data (% misclassified ± 95% confidence interval). 



Data 


N 


Ave 


Max 


Min 


Spread 


Trim (iVi-ATa) 


RDO 
13.32 ±1.66 


4 
8 


11.57 ± .22 
11.64±.18 


11.94±.25 
11.47±.22 


11. 52 ±.40 
11.29 ± .27 


11.04±.19 
11.51 ± .18 


11.34 ±.28 (3-4) 
12.30 ±.17 (4-5) 


woe 

12.07±2.23 


4 
8 


8.80±.18 
8.82±.17 


7.84 ±.20 
7.68 ±.23 


9. 31 ±.24 
8.91±.13 


8.54±.12 
8.24±.22 


8.43 ±.26 (3-4) 
7.81 ±.16 (7-8) 



Table 6: Combining Results in the Presence of High Variability in Individual Classifier 
Performance for the Probenl/UCI Benchmarks (% misclassified ± 95% confidence interval). 



Data 


N 


Ave 


Max 


Min 


Spread 


Trim (N!-N 2 ) 


Cancer 
1.49 ±.39 


4 
8 


1.38±.13 
1.32±.12 


1.38 ±.13 
1.44±.14 


1.38±.13 
1.44±.14 


1.38±.13 
1.44±.14 


1.32 ±.13 (2-3) 
1.32 ±.12 (2-6) 


Card 
14.33 ±.36 


4 
8 


13.60 ±.22 
13.66 ±.19 


13.37±.22 
13.08±.14 


13.49 ±.21 
13.02 ±.14 


13.37±.22 
12.97±.12 


13.60 ±.15 (3-4) 
13.20 ±.18 (7-8) 


Diabetes 
26.09 ±1.27 


4 
8 


25. 26 ±.37 
24.84 ±.36 


25.00 ±.46 
25.05 ±.33 


25.00 ±.42 
25.05 ±.33 


25.00 ±.42 
25.05 ±.33 


25. 26 ±.37 (3-4) 
24.84 ±.30 (6-8) 


Gene 
15.01 ±.78 


4 
8 


12.90 ±.23 
12.89 ±.22 


12. 90 ±.26 
12. 76 ±.24 


12.94 ±.25 
12.41 ±.10 


12.66±.21 
12.43 ±.22 


12.67 ±.22 (3-4) 
12.56 ±.20 (7-8) 


Glass 
42. 78 ±.75 


4 
8 


33.77 ±.27 
33.96 ±.06 


40.19±.72 
39.43 ±.27 


33.21 ±.44 
33.77±.27 


33.21 ±.44 
33.40 ±.41 


33. 77 ±.27 (2-3) 
33. 77 ±.27 (1-6) 


Soybean 
10.71 ±1.69 


4 
8 


7.76±.ll 
7.65 ±.00 


7.94 ±.14 
7.82 ±.13 


12.88 ±.39 
13.41 ±.53 


7.71 ±.15 
7.71 ±.15 


7.82 ±.18 (3-4) 
7.65 ±.00 (4-8) 



Tables |5| and |6| present the combining results for the Probenl benchmarks and the 
underwater acoustic data sets respectively, when the individual classifier performance 
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was highly variable. The misclassification percentage for individual classifiers are re- 
ported in the first column. For the trimmed mean combiner, we also provide Ni and 



N%, the upper and lower cutting points in the ordered average used in Equation 32 
obtained through the validation set. 

On the Sonar data, the results indicate that when the individual classifier perfor- 
mance is highly variable, order statistics-based combiners (particularly the spread com- 
biner) provide better classification results than simple combiners. This performance 
improvement is obtained without sacrificing the simplicity of the combiner. On the 
UCI/Probenl benchmarks, the order statistics based combiners provide better classifi- 
cation performance on three of the six sets studied (no statistically significant differences 
were detected among the various combiners in the remaining data sets). One important 
thing to note, however, is that in all eight data sets studied, the order statistics based 
combiners performed at least as well as the simple combiner, implying that no risk is 
taken by using this method. 

A close inspection of these results reveals that using either the max or min combiner 
can provide better classification rates than ave, but it is difficult to determine which of 
the two will be more successful given a data set. A validation set may be used to select 
one over the other, but in that case, potentially precious training data is used solely 
for determining which combiner to use. The use of the spread combiner removes this 
dilemma by consistently providing results that are comparable to, or better than, the 
best of the max-min duo. It is important to note that the min combiner performs poorly 
on the Soybean data. Because this data set has 19 outputs, the posterior estimates of 
unlikely classes become extremely small and highly inaccurate. Basing decisions on 
such spurious values compromises the combiner's performance. Notice, however, that 
the spread combiner is not adversely affected by this phenomenon. 



Table 7: Combining Results with Fine- Tuned Classifiers for the Sonar Data (% misclassified 
± 95% confidence interval). 



Data 


N 


Ave 


Max 


Min 


Spread 


Trim (Ni-N 2 ) 


RDO 
9.95±.36 


4 
8 


9.26 ±.32 
8.94 ±.06 


9.67 ±.20 
9.62 ±.16 


9.45 ±.19 
9.36 ±.15 


9.33 ±.20 
9.48 ±.18 


9. 28 ±.28 (2-3) 
8.92±.10 (1-6) 


woe 

7.47 ±.21 


4 
8 


7.05 ±.12 
7.17 ±.08 


7.31 ±.15 
7.19 ±.12 


7.44±.17 
7.41 ±.16 


7.31 ±.16 
7.22±.07 


7.05 ±.16 (2-3) 
7.07 ±.10 (2-6) 



When there is ample data, and all the classifiers are finely tuned (i.e., a validation set 
is used to determine the stopping time that yields the best generalization performance), 
simple combiners are expected to be adequate. However, it is not always possible 
to determine whether all conditions that lead to such an ideal situation are satisfied. 
Therefore, it is important to know whether the trimmed mean and spread combiners 
presented in this article perform worse than simple combiners under such conditions. 
To that end, we have combined finely tuned feed forward neural networks using the 
methods proposed in this article and compared the results with the traditional averaging 
method. In this new set of experiments, all the conditions favor the averaging combiner 
(i.e., all possible difficulties for the average combiner have been removed). The results 
displayed in Tables ^ and || indicate that, even under such circumstances, both the 
spread and trim combiners provide results that are comparable to those obtained by the 
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ave combiner. Furthermore, even under such conditions, the order statistics combiners 
provide statistically significant improvements on two data sets. 

Table 8: Combining Results with Fine- Tuned Classifiers for the Probenl/UCI Benchmarks 
(% misclassified ± 95% confidence interval). 



Data 


N 


Ave 


Max 


Min 




Trim f/Vi-AM 
j_i nil i ivj_ i»2 1 


Cancer 
.69±.ll 


4 
8 


0.69±.ll 
0.69±.ll 


0.69±.ll 
0.57±.01 


0.69±.ll 
0.57±.01 


0.69± .11 
0.57±.01 


0.69 ±.11 (2-3) 
0.57±.ll (7-8) 


Card 
13.87 ±.36 


4 
8 


13.14±.23 
13.14±.23 


12.91 ± .11 
12. 79 ±.01 


13. 02 ±.23 
12.79±.01 


12.91 ± .11 
12.79±.01 


13. 14 ±.23 (2-3) 
12.80 ±.01 (7-8) 


Diabetes 
23.52 ±.35 


4 
8 


23. 33 ±.29 
22.92 ±.23 


23.23 ±.30 
23.23 ±.34 


23.33 ±.24 
23.12±.34 


23.23 ±.30 
23.23 ±.34 


23.33 ±.29 (3-4) 
22.92 ±.23 (4-8) 


Gene 
13.49 ±.21 


4 
8 


12.41 ±.21 
12.26±.14 


12.46 ±.24 
12.46 ±.18 


12.51±.18 
12.16±.08 


12.41 ±.17 
12.11 ± .19 


12.41 ±.12 (3-4) 
12.16 ±.09 (1-6) 


Glass 
32.26 ±.27 


4 
8 


32.08 ±.01 
32.08 ±.01 


32.45 ±.36 
32.08 ±.01 


32.08 ±.01 
32.08 ±.01 


32.08 ±.01 
32.08±.01 


32.08 ±.01 (3-6) 
32.08 ±.01 (3-6) 


Soybean 
7.36 ±.43 


4 
8 


7.06 ±.00 
7.06 ±.00 


7.18 ± .11 
7.18±.05 


8.12±.77 
9.06 ±.82 


7.06 ±.00 
7.06 ±.00 


7.06 ±.00 (3-6) 
7.06 ±.00 (3-6) 



6 Conclusion 

In this article we present and analyze combiners based on order statistics. These com- 
biners blend the simplicity of averaging with the generality of meta-learners. They are 
particularly effective if there are significant variations among component classifiers in 
at least some parts of the joint input-output space. Variations can arise when the indi- 
vidual training sets cannot be considered as random samples from a common universal 
data set. Examples of such cases include real-time data acquisition and classification 
from geographically distributed sources or data mining problems with large databases, 
where random subsampling is computationally expensive and practical methods lead to 
non-random subsamples ||. Furthermore, The robustness of order statistics combiners 
is also helpful when certain individual classifiers experience catastrophic failures (e.g., 
due to faulty sensors). 

The analytical framework provided in this paper quantifies the reductions in error 
achieved when an order statistics based ensemble is used. It also shows that the two 
methods for linear combination of order statistics introduced in this paper provide 
more reliable estimates of the true posteriors than any of the individual order statistic 
combiners. 

The experimental results of Section 5 indicate that when there is high variability 
among the classifiers, the order statistics-based combiners significantly outperform sim- 
ple combiners, whereas in the absence of such variability these combiners perform no 
worse. Thus the family of order statistic combiners is able to extract an appropriate 
amount of information from the individual classifier outputs without requiring tuning 
additional parameters as in meta-learners, and without being substantially affected by 
outliers. 
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A future endeavor, which will be helpful for this work as well as for the study of 
classification based on very large datasets in general, is to obtain a suite of public 
domain datasets which are intrinsically partitioned into segments with varying quality. 
Though such situations sometimes occur in practice (for example in oil logging data 
pC| ] and mortgage scoring both data sets proprietary), they are not represented 
in the standard, venerable databases such as UCI, ELENA and Statlog typically used 
by the academic community. Perhaps the recent CRoss-Industry Standard Process for 
Data Mining (CRISP-DM) initiative will provide a satisfactory solution to this problem 
in the near future. 
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